OpenVar: functional annotation of variants in non-canonical open reading frames

Background Recent technological advances have revealed thousands of functional open reading frames (ORF) that have eluded reference genome annotations. These overlooked ORFs are found throughout the genome, in any reading frame of transcripts, mature or non-coding, and can overlap annotated ORFs in a different reading frame. The exploration of these novel ORFs in genomic datasets and of their role in genetic traits is hindered by a lack of software. Results Here, we present OpenVar, a genomic variant annotator that mends that gap and fosters meaningful discoveries. To illustrate the potential of OpenVar, we analysed all variants within SynMicDB, a database of cancer-associated synonymous mutations. By including non-canonical ORFs in the analysis, OpenVar yields a 33.6-fold, 13.8-fold and 8.3-fold increase in high impact variants over Annovar, SnpEff and VEP respectively. We highlighted an overlapping non-canonical ORF in the HEY2 gene where variants significantly clustered. Conclusions OpenVar integrates non-canonical ORFs in the analysis of genomic variants, unveiling new research avenues to better understand the genotype–phenotype relationships.


Background
Technological advances have recently enabled both routine whole-genome sequencing and a deeper understanding of the coding potential of genomes. On the former, with the introduction of Next-Generation Sequencing techniques, reading one's genome has never been so reliable and affordable, still a fourth of Mendelian phenotypes remain without a known molecular basis (www. omim. org/ stati stics). On the latter, the emergence of ribosome profiling (Ribo-seq) has unveiled the unsuspected depth of the coding potential of eukaryotic genomes [1]. As Ribo-seq detects mRNA translation events, it revealed signatures outside of annotated open reading frames (ORFs) defined by reference gene annotations [2]. Yet, these annotations are used worldwide in fundamental and clinical research, providing the necessary dictionary to read a genome. Now aware that thousands of functional alternative ORFs have eluded annotation [3][4][5][6], we need to assess their role in diseases [7][8][9].
Serendipitous discoveries have highlighted likely pathological variants in currently unannotated ORFs [7,8,10]. However, most studies focused on upstream ORFs [8], partly because no tool is currently available to evaluate variants falling in dual-coding regions where two ORFs overlap. This lack of software drastically impedes the discoveries and interpretation of disease-linked variants in these regions [10].
Therefore, we developed OpenVar (https:// openp rot. org/ openv ar/), an open source tool to annotate and explore the impact of genomic variants using deeper genome annotations, such as OpenProt [3]. Genomic variants encompass single nucleotide polymorphisms (SNP), insertions and deletions (INDELs), and multiple nucleotide polymorphisms (MNP). OpenVar annotates such genomic variants both based on their locus (e.g. upstream/downstream, intronic, etc.) and their effect on all coding sequences annotated in OpenProt (e.g. amino acid substitution, stop gain/loss, etc.). OpenVar is available as a python package and an easy-to-use web-based platform, facilitating its implementation independently of one's computing knowledge or resources.

Results
OpenVar adopts the widely used SnpEff code [11] and the deep genome annotation OpenProt [3] to annotate a variant calling file (VCF). To handle multiple ORFs on a single transcript, OpenVar combines the transcript and protein accession as a unique identifier. This strategy allows the visualization, for a single variant, of all consequences across every transcript and every ORF it affects (Fig. 1A). Alongside the common annotated VCF format, OpenVar outputs a table listing all effects for each variant submitted in the initial VCF. This table is named "annOn-ePerLine.tsv" as it contains one line for each effect of each variant. It lists the chromosome and position of the variant, the reference and alternative alleles, the description of the effect of the variant, the impact category, the transcript and protein accession, alongside the gene name. To ease interpretation of a much more complex annotation, OpenVar also outputs a table, named "max_impact.tsv", listing the most affected canonical and non-canonical ORFs for each variant. The prioritization of effects is based on the predicted impact of the variant on the ORF of interest. Impacts are categorized as "modifier", "low", "moderate" and "high". This classification is identical to that used in the SnpEff algorithm [11]. For a non-annotated ORF to be reported in the maximal impact table, a given variant must have a greater or identical impact on it than on a canonical ORF. OpenVar always reports the effect on canonical ORFs in each of its outputs. Thus, the effect of a given variant on a canonical ORF is always visible, but users also see the effect on any alternative ORF (Fig. 1A).
By offering deep annotation of genomic variants, OpenVar can help make meaningful discoveries. Here, we have used it to explore a specific class of variants in cancers. We took advantage of the recently published SynMic database [12]. The database lists variants leading to synonymous mutations in canonical proteins across 18,028 samples from 88 different tumor entities. Interestingly, synonymous mutations are the second most frequent in cancer samples behind missense mutations [12]. The mechanisms behind their pathological impact remain unknown for most synonymous mutations. Current theories revolve around splicing junctions and transcript structure and/or stability [10,13]. However, some of these variants may also have a greater impact on an overlapping ORF as recently demonstrated [7].
To illustrate the potential of OpenVar, we analysed all the SynMicDB variants with it and we highlighted an overlapping alternative ORF in the HEY2 gene. 26 mutations where reported within HEY2, all falling within the union of the canonical ORF (Q9UBP5) and an alternative ORF (IP_145210) detected with a unique peptide by mass spectrometry on the OpenProt resource [3]. 19 mutations (73.1%) fell exclusively in the overlapping region, when the expected value was of 10.07 ± 2.48 (38.7%), yielding a z-score of 3.60 (Fig. 1B). Additionally, we retrieved mutations within HEY2 listed in the COSMIC catalog. 243 mutations were reported within HEY2, with 157 falling within the union of the canonical and the alternative ORFs. 83 mutations (52.9%) fell exclusively in the overlapping region, when the expected value was of 60.81 ± 6.10 (38.7%), yielding a z-score of 3.64 (Fig. 1B). Furthermore, looking at mutations synonymous for Q9UBP5, the COSMIC dataset contained 55 synonymous variants. Out of these, 32 variants (58.2%) clustered on the IP_145210 ORF. The expected value was 21.3 ± 3.61 (38.7%), yielding a z-score of 2.96. Hence, both datasets present a significant enrichment of genetic variants at the locus of the alternative protein IP_145210 within HEY2.
To highlight the impact of including alternative ORFs in analyses of genomic variants, we compared the annotation of the SynMicDB and the COSMIC HEY2 datasets when analysed with OpenVar or the most common annotators: the Ensembl Variant Effect Predictor (VEP) [14], Annovar [15] and SnpEff [11] (Fig. 2A,  B). Although small differences were observed between VEP, Annovar and SnpEff, none predicted as many high impact variants than OpenVar. As the SynMicDB dataset is a database of synonymous mutations, most variants are classified as low impact with VEP, Annovar or SnpEff ( Fig. 2A). By simply considering non-canonical ORFs overlapping the annotated ORF, OpenVar reclassify many low impact variants as high impact, yielding a 33.6-fold, 13.8-fold and 8.3-fold increase over Annovar, SnpEff and VEP respectively. Similarly, when considering a more heterogeneous set of variants with the COSMIC HEY2 dataset, OpenVar offers a twofold increase over Annovar and a 1.6-fold increase over SnpEff and VEP in high impact variants (Fig. 2B). With both datasets, the increase in moderate and high impact variants observed with OpenVar comes from the reclassification of modifier and low impact variants, as visible on Fig. 2 with a relative decrease in the latter with OpenVar ( Fig. 2A, B). For example, in the HEY2 COSMIC dataset, the variant 6:g.125,759,806 T > G is located 4 nucleotides after the stop codon of the   ) and is thus classified as a "moderate" impact on IP_145210 and a "modifier" impact on Q9UBP5 by OpenVar. Since the predicted impact is higher on the alternative ORF, this variant is counted as "moderate" impact on the general statistics presented on Fig. 2 (Fig. 2B). When using OpenVar, the effect predicted on canonical ORFs is reported. For example, in the COSMIC dataset for the HEY2 gene, two variants are annotated as yielding a premature stop codon within the canonical ORF (Q9UBP5). These two variants chr6:125,759,167 C > T and chr6:125,759,503 C > T are the only two variants predicted as 'stop_gained' in the COSMIC catalog and with VEP, Annovar and SnpEff. With OpenVar, 4 variants are reported with a stop-gain effect: the 2 aforementioned variants for their effect on the canonical ORF, and 2 others yielding a premature stop codon in the alternative ORF (IP_145210): chr6:125,759,711 C > T and chr6:125,759,780 G > T.  Interestingly, when looking at the relative impact of variants on HEY2 (Fig. 2C), OpenVar predicts higher impact variants towards the carboxyl end of the annotated ORF (Q9UBP5 protein), which corresponds exactly at the position of the alternative ORF (IP_145210 protein). The IP_145210 ORF in HEY2 does not overlap with any known domain of the well-characterized Q9UBP5 protein. Q9UBP5 is a 337 amino acid long Hairy-related basic helix-loop-helix (bHLH) transcription repressor, but its functional domains span from amino acid 48 to 116 [16] (Fig. 2C). Meanwhile, the IP_145210 ORF overlaps the disordered carboxyl tail of the Q9UBP5 ORF. This observation agrees with previous reports suggesting intrinsically disordered regions are prone to host dual-coding events [7,17,18]. Thus, the results highlighted by OpenVar suggest these variants may be detrimental via their consequence on IP_145210 rather than the canonical Q9UBP5.

Discussion
In our view, exploring the effect of variants on noncanonical ORFs helps understanding the complex genotype-phenotype relationships [10]. Clearly, further research is needed to functionally characterize these overlooked ORFs detected by Ribo-seq and/or mass spectrometry. Deep annotation of the ever-growing collection of sequencing datasets will help target and guide their functional characterization.
Although we believe that public genomic datasets can inform on functional ORFs currently not annotated, it does not eclipse the need for further analyses and experimentation. When interpreting the results from OpenVar, users must consider the uncertain nature of unannotated ORFs and existing biases in sequencing datasets (e.g. panels limited to currently defined coding exons).
OpenVar does not aim at prioritizing and filtering genomic variants from large sequencing data, which can be done using publicly available frameworks, such as KGGSeq [19] or FATHMM [20]. These frameworks heavily rely on prior knowledge on the genomic locus and encoded protein, which is detrimental to the identification of variants within non-annotated ORFs. Because these ORFs are not annotated, there is a critical lack of biological knowledge. However, we encourage the use of OpenVar concomitantly to such framework and emphasize the need for caution when interpreting the results from OpenVar.
We welcome user feedback and feature requests through the OpenVar GitHub repositories for the python package (https:// github. com/ MAB-Lab/ OpenV ar) and the web-based implementation (https:// github. com/ MAB-Lab/ OpenV ar_ WebApp). Such feedback helps us to enhance and tailor OpenVar to the needs of the community. The OpenVar repository includes a detailed guide to installation and use, as well as a walk-through demonstration using provided example datasets. The OpenVar web-based platform repository contains easy-to-follow instructions to submit a dataset for analysis. Additionally, the repository includes detailed explanation of all output statistics, figures and files.

Conclusions
In summary, thousands of ORFs have eluded reference annotations as recently shown by ribosome profiling and proteogenomics studies [6,10,21,22]. OpenVar integrates such non-canonical ORFs in the analysis of genomic variants, unveiling new research avenues to understand the genotype-phenotype relationships.

Methods
Here, we present a novel tool, OpenVar, to annotate and explore the impact of genomic variants using deeper genome annotations, such as OpenProt [3]. OpenVar is available as a python package and an easy-to-use webbased platform, facilitating its implementation independently of one's computing knowledge or resources. To perform an analysis with OpenVar, one simply needs a curated variant calling file (VCF).

Data availability
OpenVar, like OpenProt, embraces the philosophy of data and results sharing to foster scientific discoveries. Thus, the OpenVar github repository (https:// github. com/ MAB-Lab/ OpenV ar) allows for download of the SynMicDB and COSMIC (HEY2 gene) datasets that were used in this article. The SynMicDB dataset was downloaded at http:// synmi cdb. dkfz. de/ rsynm icdb/. The COS-MIC dataset was downloaded at https:// cancer. sanger. ac. uk/ cosmic. The following links can be used to download OpenVar results on the SynMicDB dataset when using the Ensembl annotation (SynMic_ Ensem bl), the NCBI RefSeq annotation (SynMic_ RefSeq), the Open-Prot-Ensembl annotation (SynMic_ OP-Ensem bl) or the OpenProt-RefSeq annotation (SynMic_ OP-RefSeq). The following link can be used to download OpenVar results on the COSMIC-HEY2 dataset using the OpenProt-Ensembl annotation (COSMIC_ OP).

OpenVar output
OpenVar output contains 10 files. The output folder will contain the submitted variant calling file (VCF) as well as a text file with all analysis warning (warnings.txt). If a liftover between genome builds was required, Open-Var uses the Picard LiftoverVcf tool (GATK tools, Broad Institute) and the output folder will also contain all outputs files from the liftover. OpenVar generates a common annotated VCF file and two table files. The table  named '  To ease the filtering, impact categories are numbered from 0 to 3 corresponding respectively to "modifier", "low", "moderate" and "high". The following columns list the transcript and proteins accessions for both the canonical and alternative ORFs alongside the DNA and protein changes. Additionally, OpenVar also outputs all the figures that can be found on the results page in a scalable vector graphics (SVG) format to ease their customization and inclusion in publications. A pickle object is also included ('summary.pkl'). It contains all summary statistics needed to display the results on the web page in a python format. It can easily be loaded into any python script for downstream analysis with the following command: pickle.load(open('summary.pkl' , 'rb')). Please note you would need to import the pickle package first with import pickle.

Statistical analysis of genomic variant cluster
The number of mutation expected to fall within the overlapping coding regions was calculated as a binomial distribution: X ∼ B(n, p) , where X is the number of mutations falling within the overlapping coding region amongst a set of n mutations; and p the probability for any mutation to fall within the overlapping region. We assume that (1) a mutation can happen equally on any nucleotide, and that (2) each mutations are independent. Under these assumptions, we can define: p = length(seq 1 ∩seq 2 ) length(seq 1 ∪seq 2 ) , where seq 1 and seq 2 are the two coding sequences. The significance of the observed value was assessed by measuring its z-score with regards to the mean and standard deviation of the predicted binomial distribution. A z-score over 2.58 is considered significant (p < 0.01).

Relative impact of variants map
Each impact potential at a genomic position x is calculated as the mean of impacts from all variants in the interval x-30 bp and x + 30 bp. The maps for all annotator were normalized to the maximal impact obtained between VEP, Annovar and SnpEff.